RNA Sequencing in Pulmonary Arterial Hypertension

BMIN503/EPID600 Final Project

Author

Amber Meservey


1 Overview

Pulmonary arterial hypertension (PAH). is a disease characterized by pathologically increased pressures in the pulmonary vascular bed. Despite recent treatment advances, the disease remains associated with considerable morbidity and mortality. Currently disease diagnosis and prognosis relies simultaneously on invasive hemodynamic measurements and crude clinical associations. Accordingly, there is considerable interest in developing biomarkers for use in diagnosing, distinguishing subtype, treating and following treatment response in (PAH). Our group has previously looked at plasma protein biomarkers in a small clinical trial population of participants with pulmonary arterial hypertension. The next phase of our project expands on this prior work by examining RNA sequencing in whole blood samples collected from participants at 2 time points during the trial. I will use this project as an opportunity to work through an RNA sequencing workflow correlating RNA to basic demographic variables (age, BMI, gender) prior to work looking more specific markers of interest (6MWD, TAPSE, REVEAL score). This will allow me to work with the dataset without publishling results of interest.

The two faculty/staff I’ve met with regarding this project are Dr. Steven Kawut in Pulmonary Hypertension and Dr. Rui Feng in Biostatistics. Dr. Kawut would be considered my content expert, as he is a master clinician-researcher in PAH, as well as the lead on this clinical trial. He helps to direct appropriate clinical questions being asked of this data. Dr Rui Feng has expertise in RNA sequencing methods as well as statistical methods informing their use. She has been helpful in helping me understand how to format and use appropriate packages for this data.

https://github.com/ames78/BMIN503_Final_Project/blob/master/BMIN5030_Final_Project2.qmd

2 Introduction

Pulmonary arterial hypertension (PAH) is a disease of pathologically elevated pressures in the pulmonary arterial bed, leading to right heart failure and ~20% mortality at 3 years. Despite availability of several pharmacologic classes of medications for PAH, treatment algorithms remain largely etiology-agnostic and anchored on disease severity as opposed to disease entity. Tools to help discriminate disease etiology and predict treatment response are needed in order to better tailor treatment algorithms and to improve patient outcomes. We anticipate that the investigation of blood-based biomarkers represents an important means to achieving this goal. Our initial step toward exploring candidate biomarkers in PAH will be a retrospective examination of the baseline and change in expression of a subset of candidate protein biomarkers in an available small clinical trial of patients with PAH. Our findings will help direct future larger and more costly studies to test and if appropriate validate potential candidate protein biomarkers.

Following diagnosis of PAH by right heart catheterization (RHC), it is commonly classified by specific etiology, predominately idiopathic PAH (IPAH, ~40% of cases), followed by connective tissue disease-associated PAH (CTD-PAH, ~20% of cases) and PAH associated with several less common systemic diseases. Although all forms of PAH result from aberrant vascular remodeling and vasoconstriction in the pulmonary arteries, the underlying mechanisms are incompletely understood and likely mediated by each specific disease etiology. With a 4:1 predominance in female patients and increasing evidence for a role of adipose tissue in PAH, hormonal and metabolic pathways are also thought to play significant roles in PAH pathophysiology. This complex and heterogenous mechanistic milieu alongside significant mortality, reliance on invasive means of monitoring disease and an untailored treatment armamentarium makes PAH a disease for which investigation of non-invasive biomarkers is both justifiable and urgent.

Thus, I will assess RNA sequencing patterns by several demographic variables of interest including: RNA expression by active treatment arm (as data derived from clinical trial of medication X), sex, age and BMI, all of which could concievably alter RNA expression. Future aims will be to identify patterns of RNA expression by specific PAH etiology, by severity metrics including TAPSE and 6MWD and response to specific treatments.

3 Methods

Overall Approach: Here I will use a linear model to assess whether there is a correlation between RNA expression and treatment arm, adjusting for sex and visit number in the model. I will be adjusting for both of these in all analyses because the clinical trial used a sex stratification sampling method and because there are two different visit numbers. There are other ways having two visit dates for each subject could be evaluated, namely through linear mixed effects modeling. Alternatively, I could look at the outcome and adjust for corresponding baseline expression level via looping each gene. I ultimately did not use these methods for this project as each analysis was taking several hours to run making it computationally difficult. All models will be subjected to Benjamini-Hochberg adjusted p values to account for multiple comparisons. Lastly, I will also intentionally mask specific gene transcript labels for anonymity.

Aim 1: Differential expression analysis by treatment arm (linear model, binary outcome: active drug X vs placebo).

Aim 2: Differential expression analysis by sex (linear model, binary outcome: self-reported male vs female).

Aim 3: Differential expression analysis by age (linear model, continuous outcome: in years).

Aim 4: Differential expression analysis by BMI (linear model, continuous outcome: in kg/m2).

Aim 5: Complete PCA, UMAP, t-SNE for dimensionality reduction and visualization for whichever analysis above returns most DEGs. Color code by variables of interest. Provide Scree plot for variance and loadings for PCs where relevant.

Aim 6: Complete hierarchical clustering analysis using heatmap for visualization purposes on whichever analysis yields most DEGs.

Aim 7: Complete pathway analysis using GO and Reactome databases for whichever aim returns most DEGs. Will allow for looser p values if needed to model pathways.

Aim 8: Complete enrichment analysis using Gene Set Enrichment Analysis on whichever analysis yields the most DEGs.

4 Load Packages


Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
✔ readr     2.1.5     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: 'reshape2'


The following object is masked from 'package:tidyr':

    smiths


Loading required package: Matrix


Attaching package: 'Matrix'


The following objects are masked from 'package:tidyr':

    expand, pack, unpack



Attaching package: 'lmerTest'


The following object is masked from 'package:lme4':

    lmer


The following object is masked from 'package:stats':

    step




clusterProfiler v4.12.6 Learn more at https://yulab-smu.top/contribution-knowledge-mining/

Please cite:

T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan,
X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal
enrichment tool for interpreting omics data. The Innovation. 2021,
2(3):100141


Attaching package: 'clusterProfiler'


The following object is masked from 'package:purrr':

    simplify


The following object is masked from 'package:stats':

    filter


Loading required package: AnnotationDbi

Loading required package: stats4

Loading required package: BiocGenerics


Attaching package: 'BiocGenerics'


The following objects are masked from 'package:lubridate':

    intersect, setdiff, union


The following object is masked from 'package:limma':

    plotMA


The following objects are masked from 'package:dplyr':

    combine, intersect, setdiff, union


The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs


The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
    tapply, union, unique, unsplit, which.max, which.min


Loading required package: Biobase

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


Loading required package: IRanges

Loading required package: S4Vectors


Attaching package: 'S4Vectors'


The following object is masked from 'package:clusterProfiler':

    rename


The following objects are masked from 'package:Matrix':

    expand, unname


The following objects are masked from 'package:lubridate':

    second, second<-


The following object is masked from 'package:tidyr':

    expand


The following objects are masked from 'package:dplyr':

    first, rename


The following object is masked from 'package:utils':

    findMatches


The following objects are masked from 'package:base':

    expand.grid, I, unname



Attaching package: 'IRanges'


The following object is masked from 'package:clusterProfiler':

    slice


The following object is masked from 'package:lubridate':

    %within%


The following object is masked from 'package:purrr':

    reduce


The following objects are masked from 'package:dplyr':

    collapse, desc, slice



Attaching package: 'AnnotationDbi'


The following object is masked from 'package:clusterProfiler':

    select


The following object is masked from 'package:dplyr':

    select




ReactomePA v1.48.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/

If you use ReactomePA in published research, please cite:
Guangchuang Yu, Qing-Yu He. ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Molecular BioSystems 2016, 12(2):477-479

Loading required package: viridisLite

5 Load Files

6 Transpose and clean RNA count sheet

## TPM dataframe is currently gene names on rows, kit_number on columns. 
## Transpose so that kit_numbers are the rows and row names. 

TPMcounts_t <- as.data.frame(t(TPMcounts))
## V1/V2 currently column name, gene name in column 1. Kit_number = row names. 
## Replace V1/V2 column name with gene name in column 1. 
colnames(TPMcounts_t) <- TPMcounts_t[1,]
## Remove redundant row with gene name now that its the column title. 
TPMcounts_t <- TPMcounts_t[-1,]

7 Add kit_numbers to demographics

## Currently "RNAdemo" has 84 observations for patients, whereas "RNAids" has 164 observations for 2 visits * 84 patients with some missingness. 
## In order to add kit_number to RNAdemo without losing info, I will duplicate the patient info so as to add kit_numbers corresponding to visit_num

RNAdemodup <- RNAdemo[rep(1:nrow(RNAdemo), each = 2),]
RNAdemodup$visit_num <- rep(c(1,2), nrow(RNAdemo))

## Change visit_num = 2 to 3 to match other coding. 
RNAdemodup$visit_num <- ifelse(RNAdemodup$visit_num == 2, 3, RNAdemodup$visit_num)
## Move visit_num up front. 
RNAdemodup <- RNAdemodup[, c("visit_num", setdiff(names(RNAdemodup), "visit_num"))]

## Only save the variables I plan to work with (too many variables):
# Subset only the desired columns
clininfo <- RNAdemodup[, c("pt", "visit_num", "age_der", "gender", "arm", "BMI")]

clininfo <- merge(clininfo, RNAids[, c("pt", "visit_num", "kit_number")],
                  by = c("pt", "visit_num"), 
                  all.x = TRUE)

clininfo <- clininfo[, c("kit_number", setdiff(names(clininfo), "kit_number"))]

8 Remove kits that didn’t meet QC and NAs (n = 6+ 4= 10). Complete Case Analysis.

## QC - kit numbers that did not meet QC 
kit_numbers_to_remove <- c("1102", "1105", "1110", "1111", "1503", "1607")
clininfo <- clininfo |>
  filter(!kit_number %in% kit_numbers_to_remove)

## Leaves me with 162 rows (6 removed)- 4 NAs- remove since we wont have RNA data for these non-existant kit_numbers. 
sum(is.na(clininfo$kit_number))
[1] 4
clininfo <- clininfo[!is.na(clininfo$kit_number),]
## Leaves 158 obs. 

9 Log transform TPM counts

## Visualize TPM data
# Combine all values into a single numeric vector
all_values <- as.numeric(unlist(TPMcounts_t))
Warning: NAs introduced by coercion
ggplot(data.frame(all_values), aes(x = all_values)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
  labs(title = "Original TPM Data", x = "TPM Values", y = "Frequency") +
  theme_minimal()
Warning: Removed 57500 rows containing non-finite outside the scale range
(`stat_bin()`).

## Many zeros, add constant 1 to avoid log(0)
## Remove gene name
TPMcounts_t <- TPMcounts_t[-nrow(TPMcounts_t),]

# make numeric for log transformation 
TPMcounts_t[] <- lapply(TPMcounts_t, function(x) as.numeric(as.character(x)))

## log transform TPM to work with limma, stabilize variance
TPMcounts_t_log2 <- log2(TPMcounts_t + 1)


## Visualize log data: 
all_values_log <- as.numeric(unlist(TPMcounts_t_log2))
ggplot(data.frame(all_values_log), aes(x = all_values_log)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
  labs(title = "Log2 Transformed TPM Distribution", x = "TPM Values", y = "Frequency") +
  theme_minimal()

10 Combine log-TPM RNA expression with clinical info.

## Add kit_number:
## Extract kit_number from row names into a column for merging purposes. 
## Make kit_number the first column. 
TPMcounts_t_log2$kit_number <- rownames(TPMcounts_t_log2)
TPMcounts_t_log2 <- TPMcounts_t_log2[, c("kit_number", setdiff(colnames(TPMcounts_t_log2), "kit_number"))]


# Remove x from kit_number in TPMsheet to match clinical. 
TPMcounts_t_log2$kit_number <- gsub("^X", "", TPMcounts_t_log2$kit_number)

## Merge log- TPM RNA expression with clinical info: 
merged <- merge(clininfo, TPMcounts_t_log2, by = "kit_number")

# Recode arm from A/B to 1/2:
merged$arm <- ifelse(merged$arm == "A", 1, 
                           ifelse(merged$arm == "B", 2, merged$arm))

11 Update Expression Matrix for Use in DEG

## Format logTPMcounts to expression matrix (gene name as row names)
TPMcounts_t_log2_t <- t(TPMcounts_t_log2)
logexpressionmatrix <- TPMcounts_t_log2_t
logexpressionmatrix <- as.matrix(logexpressionmatrix)

12 Results

13 Analysis 1: DEG for Treatment Arm Adjusted by Sex and Visit

There were no significantly differentially expressed RNA transcripts by treatment arm after adjusting for sex and visit.

Warning: Zero sample variances detected, have been offset away from zero

14 Analysis 2: Differential Expression by Sex Adjusted by Visit

There were several differentially expressed RNA transcripts by sex after adjusting for visit number.

## Does expression of any RNA transcripts differ by sex (binary, self-reported male vs. female)?
## Male = 1/Female =2 
## Define Variables of Interest, Analysis DF
treatment <- factor(merged$arm)
gender <- factor(merged$gender)
visit <- factor(merged$visit_num)
analysis2 <- model.matrix (~ gender + visit)
colnames(analysis2) <- c("Intercept", "Gender2vs1", "Visit3vs1")


## Fit Linear Model
fit_g2 <- lmFit(logexpressionmatrix, analysis2)

## Apply Empirical Bayes Moderation: Stabilizes variance 
fit_g2 <- eBayes(fit_g2)
Warning: Zero sample variances detected, have been offset away from zero
## Fit Results
results_g2<- topTable(fit_g2, coef = "Gender2vs1", adjust.method = "BH", number = Inf)

## Filter Significance by Benjamini-Hochberg Correction for Multiple Comparisons
results_g2$Significant <- ifelse(results_g2$adj.P.Val <= 0.05, "Significant", "Not Significant")

## Color Coding:
results_g2$colorcoding <- ifelse(
  results_g2$Significant == "Significant" & results_g2$logFC > 0, "Pink (Overrepresented in Female)",
  ifelse(results_g2$Significant == "Significant" & results_g2$logFC < 0, "Blue (Overrepresented in Male)", "Not Significant")
)

## Volcano Plot for Sex:  
ggplot(results_g2, aes(x = logFC, y = -log10(adj.P.Val), color = colorcoding)) +
  geom_point(alpha = 0.8, size = 2) +
  scale_color_manual(
    values = c(
      "Pink (Overrepresented in Female)" = "hotpink4",
      "Blue (Overrepresented in Male)" = "cornflowerblue",
      "Not Significant" = "gray"
  ) 
  ) +
  labs(
    title = "Differential Expression by Sex",
    x = "Log2 Fold Change", 
    y = "-Log10 BH-Adjusted P-value"
  ) +
  theme_minimal() +
  theme(
    legend.title = element_blank(), 
    legend.position = "right")

15 Aim 3: Differential Expression by Age Adjusted by Sex and Visit

There were 379 differentially expressed RNA transcripts by age after adjusting for sex and visit.

## make sure gene names are carried forward 
genenames <- colnames(merged)[8:57507]
rownames(logexpressionmatrix) <- genenames




## Does expression of any RNA transcripts differ by age (continuous, in years)?
## Define Variables of Interest, Analysis DF
age <- merged$age
gender <- factor(merged$gender)
visit <- factor(merged$visit_num)
analysis3 <- model.matrix (~ age + gender + visit)
colnames(analysis3) <- c("Intercept", "Age", "Gender2vs1", "Visit3vs1")


## Fit Linear Model
fit_h3 <- lmFit(logexpressionmatrix, analysis3)

## Apply Empirical Bayes Moderation: Stabilizes variance 
fit_h3 <- eBayes(fit_h3)
Warning: Zero sample variances detected, have been offset away from zero
## Fit Results
## Dont sort so you can add gene names back
results_h3 <- topTable(fit_h3, coef = "Age", adjust.method = "BH", number = Inf, sort.by = "none")
results_h3$Gene <- rownames(fit_h3)


## Filter Significance by Benjamini-Hochberg Correction for Multiple Comparisons
results_h3$Significant <- ifelse(results_h3$adj.P.Val <= 0.05, "Significant", "Not Significant")

## Color Coding:
results_h3$colorcoding <- ifelse(
  results_h3$Significant == "Significant" & results_h3$logFC > 0, "Pink (Overrepresented in Young)",
  ifelse(results_h3$Significant == "Significant" & results_h3$logFC < 0, "Blue (Overrepresented in Old)", "Not Significant")
)

## Volcano Plot for Age:  
ggplot(results_h3, aes(x = logFC, y = -log10(adj.P.Val), color = colorcoding)) +
  geom_point(alpha = 0.8, size = 2) +
  scale_color_manual(
    values = c(
      "Pink (Overrepresented in Young)" = "hotpink4",
      "Blue (Overrepresented in Old)" = "cornflowerblue",
      "Not Significant" = "gray"
  ) 
  ) +
  labs(
    title = "Differential Expression by Age",
    x = "Log2 Fold Change", 
    y = "-Log10 BH-Adjusted P-value"
  ) +
  theme_minimal() +
  theme(
    legend.title = element_blank(), 
    legend.position = "right")

16 Analysis 4: Differential Expression by BMI Adjusted by Sex and Visit

There were 8 differentially expressed RNA transcripts by BMI after adjusting for sex and visit.

## Does expression of any RNA transcripts differ by BMI (continuous, in kg/m2)?
## Define Variables of Interest, Analysis DF
bmi <- merged$BMI
gender <- factor(merged$gender)
visit <- factor(merged$visit_num)
analysis4 <- model.matrix (~ bmi + gender + visit)
colnames(analysis4) <- c("Intercept", "BMI", "Gender2vs1", "Visit3vs1")

## Fit Linear Model
fit_i4 <- lmFit(logexpressionmatrix, analysis4)

## Apply Empirical Bayes Moderation: Stabilizes variance 
fit_i4 <- eBayes(fit_i4)
Warning: Zero sample variances detected, have been offset away from zero
## Fit Results
results_i4<- topTable(fit_i4, coef = "BMI", adjust.method = "BH", number = Inf)

## Filter Significance by Benjamini-Hochberg Correction for Multiple Comparisons
results_i4$Significant <- ifelse(results_i4$adj.P.Val <= 0.05, "Significant", "Not Significant")

## Color Coding:
results_i4$colorcoding <- ifelse(
  results_i4$Significant == "Significant" & results_i4$logFC > 0, "Pink (Overrepresented in Lower BMIs)",
  ifelse(results_i4$Significant == "Significant" & results_i4$logFC < 0, "Blue (Overrepresented in Higher BMIs)", "Not Significant")
)

## Volcano Plot for Age:  
ggplot(results_i4, aes(x = logFC, y = -log10(adj.P.Val), color = colorcoding)) +
  geom_point(alpha = 0.8, size = 2) +
  scale_color_manual(
    values = c(
      "Pink (Overrepresented in Lower BMIs)" = "hotpink4",
      "Blue (Overrepresented in Higher BMIs)" = "cornflowerblue",
      "Not Significant" = "gray"
  ) 
  ) +
  labs(
    title = "Differential Expression by BMI",
    x = "Log2 Fold Change", 
    y = "-Log10 BH-Adjusted P-value"
  ) +
  theme_minimal() +
  theme(
    legend.title = element_blank(), 
    legend.position = "right")

17 Principal Components Analysis for All Datapoints

PCA, UMAP and t-SNE were all used on the significant age gene transcript sets in order to visualize trends. Color coding was used to visualize how variables of interest related to dimensionality vectors. Age demonstrated the most visually compelling trend, as would be expected on age dataset.

## PCA for Age: 
## Extract significantly different age (3325 sig genes)
sigagegenes <- results_h3[results_h3$Significant == "Significant", ]
sigagegenenames <- rownames(sigagegenes)
sigagematrix <- logexpressionmatrix[sigagegenenames, ]
sigagematrix <- t(sigagematrix)

## Center and scale:
PCAcolumns1<- sigagematrix
PCAresults1 <- prcomp(PCAcolumns1, center = TRUE, scale = TRUE)
PCAscores1 <- as.data.frame(PCAresults1$x)

ggplot(PCAscores1, aes(x = PC1, y = PC2)) +
  geom_point(color = "cornflowerblue", alpha = 0.7) +
  labs(title = "PCA Scatterplot (PC1 vs PC2)", x = "PC1", y = "PC2") +
  theme_minimal()

variance1 <- summary(PCAresults1)$importance[2, ] * 100

barplot(
  variance1,
  names.arg = paste0("PC", seq_along(variance1)),
  main = "Scree Plot",
  xlab = "Principal Components",
  ylab = "Percentage of Variance Explained",
  col = "cornflowerblue",
  las = 2
)

loadings1 <- as.data.frame(PCAresults1$rotation)
ggplot(loadings1, aes(x = rownames(loadings1), y = PC1)) +
  geom_bar(stat = "identity", fill = "darkorange") +
  labs(title = "Variable Loadings for PC1", x = "Variables1", y = "Loading1") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

loadings1 <- as.data.frame(PCAresults1$rotation)
ggplot(loadings1, aes(x = rownames(loadings1), y = PC2)) +
  geom_bar(stat = "identity", fill = "cyan4") +
  labs(title = "Variable Loadings for PC2", x = "Variables1", y = "Loading1") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

18 Repeat Principal Component Analysis for Age, Followup Visit Only

# Remove "X" prefix from logexpressionmatrix column names for matching
colnames(logexpressionmatrix) <- sub("^X", "", colnames(logexpressionmatrix))
# Filter kit_numbers for visit_num = 1 and visit_num = 3
kitnumbersvisit1 <- merged$kit_number[merged$visit_num == 1]
kitnumbersvisit3 <- merged$kit_number[merged$visit_num == 3]
logexpressionvisit1 <- logexpressionmatrix[, colnames(logexpressionmatrix) %in% kitnumbersvisit1]
logexpressionvisit3 <- logexpressionmatrix[, colnames(logexpressionmatrix) %in% kitnumbersvisit3]

# Followup visit only
kitnumbersvisit3 <- merged$kit_number[merged$visit_num == 3]
merged_subset <- merged[merged$kit_number %in% kitnumbersvisit3, ]


## Does expression of any RNA transcripts differ by age (continuous, in years)?
# Create design matrix using the subset of merged
age <- merged_subset$age
gender <- factor(merged_subset$gender)
analysis5 <- model.matrix(~ age + gender)
colnames(analysis5) <- c("Intercept", "Age", "Gender2vs1")

fit_h5 <- lmFit(logexpressionvisit3, analysis5)

## Apply Empirical Bayes Moderation: Stabilizes variance 
fit_h5 <- eBayes(fit_h5)
Warning: Zero sample variances detected, have been offset away from zero
## Fit Results
results_h5<- topTable(fit_h5, coef = "Age", adjust.method = "BH", number = Inf)

## Filter Significance by Benjamini-Hochberg Correction for Multiple Comparisons
results_h5$Significant <- ifelse(results_h5$adj.P.Val <= 0.05, "Significant", "Not Significant")

## Color Coding:
results_h5$colorcoding <- ifelse(
  results_h5$Significant == "Significant" & results_h5$logFC > 0, "Pink (Overrepresented in Young)",
  ifelse(results_h5$Significant == "Significant" & results_h5$logFC < 0, "Blue (Overrepresented in Old)", "Not Significant")
)

## Volcano Plot for Age:  
ggplot(results_h5, aes(x = logFC, y = -log10(adj.P.Val), color = colorcoding)) +
  geom_point(alpha = 0.8, size = 2) +
  scale_color_manual(
    values = c(
      "Pink (Overrepresented in Young)" = "hotpink4",
      "Blue (Overrepresented in Old)" = "cornflowerblue",
      "Not Significant" = "gray"
  ) 
  ) +
  labs(
    title = "Differential Expression by Age",
    x = "Log2 Fold Change", 
    y = "-Log10 BH-Adjusted P-value"
  ) +
  theme_minimal() +
  theme(
    legend.title = element_blank(), 
    legend.position = "right")

## PCA for Age: 
## Extract significantly different age
sigagegenes1 <- results_h5[results_h5$Significant == "Significant", ]
sigagegenenames1 <- rownames(sigagegenes1)
sigagematrix1 <- logexpressionvisit3[sigagegenenames1, ]
PCAcolumns2 <- t(sigagematrix1)  # Transpose the matrix
PCAresults2 <- prcomp(PCAcolumns2, center = TRUE, scale = TRUE)
PCAscores2 <- as.data.frame(PCAresults2$x)

ggplot(PCAscores2, aes(x = PC1, y = PC2)) +
  geom_point(color = "cornflowerblue", alpha = 0.7) +
  labs(title = "PCA Scatterplot (PC1 vs PC2)", x = "PC1", y = "PC2") +
  theme_minimal()

variance2 <- summary(PCAresults2)$importance[2, ] * 100

barplot(
  variance2,
  names.arg = paste0("PC", seq_along(variance2)),
  main = "Scree Plot",
  xlab = "Principal Components",
  ylab = "Percentage of Variance Explained",
  col = "cornflowerblue",
  las = 2
)

loadings2 <- as.data.frame(PCAresults2$rotation)
ggplot(loadings2, aes(x = factor(rownames(loadings2)), y = PC1)) +
  geom_bar(stat = "identity", fill = "darkorange") +
  labs(title = "Variable Loadings for PC1", x = "Variables", y = "Loading") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(loadings2, aes(x = factor(rownames(loadings2)), y = PC2)) +
  geom_bar(stat = "identity", fill = "lightyellow") +
  labs(title = "Variable Loadings for PC2", x = "Variables", y = "Loading") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Color the PCA by Age for easier visualization 
# Extract the age variable for the same samples
sampleage <- rownames(PCAcolumns2)  # Samples in PCA
age_sub <- merged$age[match(sampleage, merged$kit_number)]  # Match age by sample names
PCAscores2 <- as.data.frame(PCAresults2$x)
PCAscores2$age <- age_sub
ggplot(PCAscores2, aes(x = PC1, y = PC2, color = age)) +
  geom_point(size = 3, alpha = 0.7) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(
    title = "PCA Scatterplot (Colored by Age)",
    x = "PC1",
    y = "PC2",
    color = "Age"
  ) +
  theme_minimal()

# Create age groups
PCAscores2$age_group <- ifelse(PCAscores2$age < 50, "Under 50", "Over 50")
ggplot(PCAscores2, aes(x = PC1, y = PC2, color = age_group)) +
  geom_point(size = 3, alpha = 0.7) +
  scale_color_manual(values = c("Under 50" = "blue", "Over 50" = "red")) +
  labs(
    title = "PCA Scatterplot (Under 50 vs. Over 50)",
    x = "PC1",
    y = "PC2",
    color = "Age Group"
  ) +
  theme_minimal()

19 UMAP for Age, Followup Visit Only

umapvisit3<- umap(PCAcolumns2)
umapvisit3results <- umap::umap(PCAcolumns2, n_neighbors = 15, min_dist = 0.1, metric = "euclidean")
umapvisit3df <- as.data.frame(umapvisit3results$layout)
colnames(umapvisit3df) <- c("UMAP1", "UMAP2")
ggplot(umapvisit3df, aes(x = UMAP1, y = UMAP2, color = PCAscores2$age)) +
  geom_point(size = 3, alpha = 0.7) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(
    title = "UMAP Visualization (Colored by Age)",
    x = "UMAP1",
    y = "UMAP2",
    color = "Age"
  ) +
  theme_minimal()

20 UMAP on Entire Dataset

## UMAP on full dataset, colored by various conditions
joined <- merged[merged$kit_number %in% colnames(logexpressionmatrix), ]
joined <- joined[match(colnames(logexpressionmatrix), joined$kit_number), ]
log_t <- t(logexpressionmatrix)
umapall<- umap(log_t, n_neighbors = 15, min_dist = 0.1, metric = "euclidean")

umap_df <- as.data.frame(umapall$layout)
colnames(umap_df) <- c("UMAP1", "UMAP2")
umap_df$kit_number <- joined$kit_number  
umap_df <- cbind(umap_df, joined)
umap_df <- umap_df[, !duplicated(colnames(umap_df))]



## Age
ggplot(umap_df, aes(x = UMAP1, y = UMAP2, color = age_der)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(
    title = "UMAP- Entire Dataset",
    x = "UMAP1",
    y = "UMAP2",
    color = "Age"
  ) +
  theme_minimal()

## Arm 
ggplot(umap_df, aes(x = UMAP1, y = UMAP2, color = arm)) +
  geom_point(size = 3, alpha = 0.7) +
  scale_color_manual(
    values = c("1" = "cornflowerblue", "2" = "darkorange")
  ) +
  labs(
    title = "UMAP- Entire Dataset",
    x = "UMAP1",
    y = "UMAP2",
    color = "Treatment Arm"
  ) +
  theme_minimal()

## Visit Number
ggplot(umap_df, aes(x = UMAP1, y = UMAP2, color = visit_num)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(
    title = "UMAP- Entire Dataset",
    x = "UMAP1",
    y = "UMAP2",
    color = "Visit Number"
  ) +
  theme_minimal()

## Gender 
ggplot(umap_df, aes(x = UMAP1, y = UMAP2, color = gender)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(
    title = "UMAP- Entire Dataset",
    x = "UMAP1",
    y = "UMAP2",
    color = "gender"
  ) +
  theme_minimal()

## BMI 
ggplot(umap_df, aes(x = UMAP1, y = UMAP2, color = BMI)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(
    title = "UMAP- Entire Dataset",
    x = "UMAP1",
    y = "UMAP2",
    color = "BMI"
  ) +
  theme_minimal()

21 t-SNE on Entire Dataset

set.seed(1)
tsne <- Rtsne(log_t, dims = 2, perplexity = 30, verbose = TRUE, max_iter = 500)
Performing PCA
Read the 158 x 50 data matrix successfully!
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Building tree...
Done in 0.01 seconds (sparsity = 0.745313)!
Learning embedding...
Iteration 50: error is 56.087173 (50 iterations in 0.01 seconds)
Iteration 100: error is 60.914188 (50 iterations in 0.01 seconds)
Iteration 150: error is 63.351929 (50 iterations in 0.01 seconds)
Iteration 200: error is 61.041334 (50 iterations in 0.01 seconds)
Iteration 250: error is 61.019561 (50 iterations in 0.01 seconds)
Iteration 300: error is 1.469315 (50 iterations in 0.01 seconds)
Iteration 350: error is 0.973240 (50 iterations in 0.01 seconds)
Iteration 400: error is 0.810475 (50 iterations in 0.01 seconds)
Iteration 450: error is 0.711370 (50 iterations in 0.01 seconds)
Iteration 500: error is 0.645050 (50 iterations in 0.01 seconds)
Fitting performed in 0.09 seconds.
tsne_df <- as.data.frame(tsne$Y)
colnames(tsne_df) <- c("tSNE1", "tSNE2")
tsne_df <- cbind(tsne_df, umap_df[, c("arm", "age_der", "visit_num", "gender", "BMI")])
ggplot(tsne_df, aes(x = tSNE1, y = tSNE2, color = arm)) +
  geom_point(size = 3, alpha = 0.7) +
  scale_color_manual(values = c("1" = "cornflowerblue", "2" = "darkorange")) +
  labs(
    title = "t-SNE- Entire Dataset",
    x = "t-SNE1",
    y = "t-SNE2",
    color = "Arm"
  ) +
  theme_minimal()

ggplot(tsne_df, aes(x = tSNE1, y = tSNE2, color = age_der)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(
    title = "t-SNE - Entire Dataset",
    x = "t-SNE1",
    y = "t-SNE2",
    color = "Age"
  ) +
  theme_minimal()

ggplot(tsne_df, aes(x = tSNE1, y = tSNE2, color = BMI)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(
    title = "t-SNE - Entire Dataset",
    x = "t-SNE1",
    y = "t-SNE2",
    color = "BMI"
  ) +
  theme_minimal()

22 Pathway Analysis for Age

Several immunologically related pathways were differentially expressed by age in participants with PAH.

sigagegenes<- results_h3[results_h3$Significant == "Significant", ]
entrez <- bitr(
  sigagegenes$Gene,
  fromType = "ENSEMBL",
  toType = "ENTREZID",
  OrgDb = org.Hs.eg.db
)
'select()' returned 1:many mapping between keys and columns
Warning in bitr(sigagegenes$Gene, fromType = "ENSEMBL", toType = "ENTREZID", :
23.91% of input gene IDs are fail to map...
sigagegenes1 <- merge(sigagegenes, entrez, by.x = "Gene", by.y = "ENSEMBL", all.x = TRUE)

## create new df without NA
sigagegenes2 <- sigagegenes1[!is.na(sigagegenes1$ENTREZID), ]
gopathanalysis <- enrichGO(
  gene = sigagegenes2$ENTREZID, 
  OrgDb = org.Hs.eg.db, 
  keyType = "ENTREZID",
  ont = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05
)

dotplot(gopathanalysis, 
        showCategory = 8,            
        title = "GO Pathway Analysis for Age") +
  theme_minimal() + 
  theme(
    legend.text = element_text(size = 6),
    legend.title = element_text(size = 6),
    axis.text = element_text(size = 6)
  )

23 Followup Age in Reactome

No significant pathways were demonstrated in the reactome database.

sigagegenes1 <- results_h5[results_h5$Significant == "Significant", ]
sigagegenenames1 <- rownames(sigagegenes1)
sigagegenes1$Gene <- rownames(sigagegenes1)
entrez <- bitr(
  sigagegenes1$Gene,
  fromType = "ENSEMBL",
  toType = "ENTREZID",
  OrgDb = org.Hs.eg.db
)
'select()' returned 1:many mapping between keys and columns
Warning in bitr(sigagegenes1$Gene, fromType = "ENSEMBL", toType = "ENTREZID", :
26.65% of input gene IDs are fail to map...
sigagegenes1 <- merge(sigagegenes1, entrez, by.x = "Gene", by.y = "ENSEMBL", all.x = TRUE)

## create new df without NA
sigagegenes2 <- sigagegenes1[!is.na(sigagegenes1$ENTREZID), ]
repathanalysis <- enrichPathway(
  gene = sigagegenes2$ENTREZID, 
  organism = "human",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05
)

if (is.null(repathanalysis) || nrow(repathanalysis) == 0) {
  print("No significantly enriched pathways.")} else {print("Yes")}
[1] "No significantly enriched pathways."

24 Heatmap for Age

Demonstrates hierarchical clustering for those transcripts differentially expressed by age in participants with PAH. Number of transcripts shown both overall at 379 as well as the top 10 significant age related genes for ease of visualization. Gene labels removed for anonymity.

## Extract Genes from Prior Analysis
valid_genes <- sigagegenenames1[sigagegenenames1 %in% rownames(logexpressionmatrix)]
valid_genes <- valid_genes[valid_genes %in% rownames(logexpressionmatrix)]
expressheat <- logexpressionmatrix[valid_genes, ]
expressheat <- as.matrix(expressheat)

## All 379 Significant Age Genes
pheatmap(
  expressheat,                 
  cluster_rows = TRUE,         
  cluster_cols = TRUE,         
  scale = "row",        
  show_rownames = FALSE,
  show_colnames = FALSE,
  main = "Heatmap of 379 Significant Age Related Genes"
)

# Top 10 Genes only 
valid10 <- valid_genes[1:10]
expressheat10 <- logexpressionmatrix[valid10, ]
expressheat10 <- as.matrix(expressheat10)  
pheatmap(
  expressheat10,           
  cluster_rows = TRUE,         
  cluster_cols = TRUE,        
  scale = "row",         
  show_rownames = FALSE,
  show_colnames = FALSE,
  main = "Heatmap of Top 10 Significant Age Related Genes"
)

25 Functional Gene Enrichment Analysis for Age

There is significant enrichment for the pathway “Xenobiotic Metabolism” in the significantly differentially expressed RNA transcripts by age.

## obtain list of significant age related genes from prior analysis, make sure gene label is present and sort
one <- results_h5[results_h5$Significant == "Significant", ]
one$Gene <- rownames(one)
rankage <- one$logFC
names(rankage) <- one$Gene
rankage <- sort(rankage, decreasing = TRUE)

## update such that all symbols match ensembl
convert<- bitr(
  names(rankage),      
  fromType = "ENSEMBL", 
  toType = "SYMBOL",    
  OrgDb = org.Hs.eg.db
)
'select()' returned 1:many mapping between keys and columns
Warning in bitr(names(rankage), fromType = "ENSEMBL", toType = "SYMBOL", :
26.65% of input gene IDs are fail to map...
rankage <- rankage[names(rankage) %in% convert$ENSEMBL] 
names(rankage) <- convert$SYMBOL[match(names(rankage), convert$ENSEMBL)]

## pathways 
pathways <- msigdbr(species = "Homo sapiens", category = "H")
pathway2 <- split(pathways$gene_symbol, pathways$gs_name)
fgseaage <- fgseaMultilevel(
  pathways = pathway2,
  stats = rankage,
  minSize = 1,
  maxSize = 5000
)

fgseaage_sig <- fgseaage[fgseaage$padj < 0.5, ]


toppathway <- fgseaage_sig[1, "pathway"]
toppathway <- as.character(fgseaage_sig[1, "pathway"])
plotEnrichment(pathway2[[toppathway]], rankage) +
  labs(title = paste("Enrichment for", toppathway))

26 Conclusion

After log-transforming TPM RNA sequencing transcripts and adjusting for sex and visit number, age generated the highest number of differentially expressed genes within a population of clinical trial participants with pulmonary arterial hypertension (PAH). Treatment arm did not demonstrate significantly deferentially expressed genes. Sex (adjusted only for visit number) and BMI demonstrated comparatively fewer deferentially expressed RNA transcripts. Those RNA transcripts that were significantly up or downregulated in the age differential expression analysis using a Benjamini Hochberg approach to account for multiple comparisons were then analyzed and/or subjected to several commonly employed visualization methods in RNA seq projects. I first completed a principal components analysis on both the entire dataset and just the followup set. The loadings for these would allow one to identify specific transcripts of interest. I then generated PCAs, UMAPs and t-SNEs for dimensionality reduction and color coded by variables of interest to visualize how these may or may not relate to the vectors yielded by various techniques. A pathway analysis revealed that age related transcripts were most heavily related to immunologic biologic pathways. A heatmap helped to visualize hierarchical clustering within age pathways for participants with PAH. Lastly, a gene enrichment analysis showed enrichment of age-related RNA sequencing transcripts within the “Xenobiotic Metabolism” pathway.